{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regressão Linear Simples - Trabalho"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estudo de caso: Seguro de automóvel sueco"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Agora, sabemos como implementar um modelo de regressão linear simples. Vamos aplicá-lo ao conjunto de dados do seguro de automóveis sueco. Esta seção assume que você baixou o conjunto de dados para o arquivo insurance.csv, o qual está disponível no notebook respectivo.\n",
    "\n",
    "\n",
    "O conjunto de dados envolve a previsão do pagamento total de todas as reclamações em milhares de Kronor sueco, dado o número total de reclamações. É um dataset composto por 63 observações com 1 variável de entrada e 1 variável de saída. Os nomes das variáveis são os seguintes:\n",
    "\n",
    "1. Número de reivindicações.\n",
    "2. Pagamento total para todas as reclamações em milhares de Kronor sueco.\n",
    "\n",
    "\n",
    "Voce deve adicionar algumas funções acessórias à regressão linear simples. Especificamente, uma função para carregar o arquivo CSV chamado load_csv (), uma função para converter um conjunto de dados carregado para números chamado str_column_to_float (), uma função para avaliar um algoritmo usando um conjunto de treino e teste chamado split_train_split (), a função para calcular RMSE chamado rmse_metric () e uma função para avaliar um algoritmo chamado evaluate_algorithm().\n",
    "\n",
    "\n",
    "Utilize um conjunto de dados de treinamento de 60% dos dados para preparar o modelo. As previsões devem ser feitas nos restantes 40%.\n",
    "\n",
    "\n",
    "Compare a performance do seu algoritmo com o algoritmo baseline, o qual utiliza a média dos pagamentos realizados para realizar a predição ( a média é 72,251 mil Kronor)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyRegressor\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import ShuffleSplit\n",
    "import pandas as pd\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from random import randrange\n",
    "from csv import reader\n",
    "from math import sqrt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regressao Linear"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def mean(values):\n",
    "    return sum(values) / float(len(values))\n",
    "\n",
    "def coefficients(dataset):\n",
    "    x = [row[0] for row in dataset]\n",
    "    y = [row[1] for row in dataset]\n",
    "    x_mean, y_mean = mean(x), mean(y)\n",
    "    b2 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)\n",
    "    b1 = y_mean - b2 * x_mean\n",
    "    return [b1, b2]\n",
    "\n",
    "def variance(values, mean):\n",
    "    return sum([(x-mean)**2 for x in values])\n",
    "\n",
    "def covariance(x, mean_x, y, mean_y):\n",
    "    covar = 0.0\n",
    "    for i in range(len(x)):\n",
    "        covar += (x[i] - mean_x) * (y[i] - mean_y)\n",
    "    return covar\n",
    "\n",
    "\n",
    "def str_column_to_float(dataset, column):\n",
    "    for row in dataset:\n",
    "        row[column] = float(row[column].strip())\n",
    "\n",
    "\n",
    "def load_csv(filename):\n",
    "    dataset = list()\n",
    "    with open(filename, 'r') as file:\n",
    "        csv_reader = reader(file)\n",
    "        for row in csv_reader:\n",
    "            if not row:\n",
    "                continue\n",
    "            dataset.append(row)\n",
    "    return dataset\n",
    "\n",
    "\n",
    "def train_test_split(dataset, split):\n",
    "    train = list()\n",
    "    train_size = split * len(dataset)\n",
    "    dataset_copy = list(dataset)\n",
    "    while len(train) < train_size:\n",
    "        index = randrange(len(dataset_copy))\n",
    "        train.append(dataset_copy.pop(index))\n",
    "    return train, dataset_copy\n",
    "\n",
    "def rmse_metric(actual, predicted):\n",
    "    sum_error = 0.0\n",
    "    for i in range(len(actual)):\n",
    "        prediction_error = predicted[i] - actual[i]\n",
    "        sum_error += (prediction_error ** 2)\n",
    "    mean_error = sum_error / float(len(actual))\n",
    "    return sqrt(mean_error)\n",
    "\n",
    "def simple_linear_regression(train, test):\n",
    "    predictions = list()\n",
    "    b1, b2 = coefficients(train)\n",
    "    for row in test:\n",
    "        yhat = b1 + b2 * row[0]\n",
    "        predictions.append(yhat)\n",
    "    return predictions\n",
    "\n",
    "def evaluate_algorithm(dataset, algorithm, split, *args):\n",
    "    train, test = train_test_split(dataset, split)\n",
    "    test_set = list()\n",
    "    for row in test:\n",
    "        row_copy = list(row)\n",
    "        row_copy[-1] = None\n",
    "        test_set.append(row_copy)\n",
    "    predicted = algorithm(train, test_set, *args)\n",
    "    actual = [row[-1] for row in test]\n",
    "    rmse = rmse_metric(actual, predicted)\n",
    "    return rmse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ds = load_csv('insurance.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for i in range(len(ds[0])):\n",
    "    str_column_to_float(ds, i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE: 32.121\n"
     ]
    }
   ],
   "source": [
    "split = 0.6\n",
    "rmse = evaluate_algorithm(ds, simple_linear_regression, split)\n",
    "print('RMSE: %.3f' % (rmse))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Algoritmo Baseline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"insurance.csv\", header=None, names=['n_rei', 'pgtTot'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dm = DummyRegressor()\n",
    "param_grid = {\"strategy\": [\"mean\", \"median\"]}\n",
    "ss = ShuffleSplit(n_splits=1, test_size=.4, random_state=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cv = GridSearchCV(dm, cv=ss, param_grid=param_grid, scoring=\"neg_mean_squared_error\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8787.5998097994034"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cv.fit(df[['n_rei']], df['pgtTot'])\n",
    "\n",
    "cv.best_score_ * -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
